Feature Selection: An Exploration of Algorithm Performance

Jason Case, Abhishek Dugar, Daniel Nkuah, Khoa Tran

2024-07-30

Introduction

What is feature selection? Why do we care?

Introduction: Early Techniques

  • forward, backward, and stepwise variable selection in linear models.

  • Univariate screening procedures

Introduction: Modern Techniques

Advanced feature selection methods:

  • similarity-based
  • information-theoretical-based
  • sparse-learning-based
  • statistical-based

Classification of methods

  • Filter
  • Wrapper
  • Embedded
  • Hybrid

Methods: Correlation - Based Feature Selection (CFS)

  • Filter method
  • Uses the correlation coefficient to measure the relationship between each variable and the target, independently.

Methods: Recursive Feature Elimination (RFE)

  • Wrapper method
  • Removes variables iteratively.

Steps:

  1. Train the classifier
  2. compute the ranking criterion for all features
  3. remove the features with smallest ranking values

Methods: Least Absolute Shrinkage and Selection Operator (LASSO)

  • Embedded method
  • Uses L1 Regularization to shrink coefficients to zero,

Methods: CFS & RFE

  • Hybrid method
  • Combines filter and wrapper methods

Step 1: filtering using the correlation coefficient. (e.g. trim the “fat”low hanging fruit”) Step 2: remove variables iteratively using RFE.

Analysis: Spambase Dataset

  • 4601 instances of emails
  • 57 features for classification tasks
  • Binary classification: email is spam (1) or not (0)
  • 80/20 train/test split
  • Increased number of features by adding all two-way interactions.

Analysis: COVID-19 NLP Text Classification Dataset

  • 45k tweets related to COVID-19, labeled for sentiment analysis.
  • Five Sentiment label classes, ranging from extremely positive to extremely negative
  • Recoded to a binary classification task, positive (1) or negative (0)
  • 33,444 (91%) training and 3,179 (9%) testing records after recoding

“Bag of words”

  • text data converted into a matrix of word frequencies
  • each row represents a document
  • each column represents a unique word from the entire corpus
  • Large, sparse feature set

Statistical Modeling

Three metrics:

  • Accuracy on the test set3006

  • difference between accuracy on the training and test set (overtraining)

  • number of variables selected (model complexity)

Baseline: Full logistics model with no feature selection CFS: Select best of 20 correlation thresholds using cross validation RFE: Select best of 20 sized subsets using cross validation LASSO: Select best penalty term using cross validation CFS + FRE: Remove 20% of variables with lowest correlation, Select best of 20 sized subsets using cross validation

Results: Spambase Dataset

Method Number of Features Test Accuracy Score Accuracy Decrease from Train
Baseline 1653 0.915 0.070
CFS 126 0.936 -0.004
RFE 1144 0.911 0.067
LASSO 50 0.903 -0.006
CFS + RFE 683 0.920 0.045

Results: COVID-19 NLP Text Classification Dataset

Method Number of Features Test Accuracy Score Accuracy Decrease from Train
Baseline 4820 0.846 0.097
CFS 1449 0.868 0.042
RFE 2980 0.863 0.059
LASSO 4472 0.880 0.045
CFS + RFE 2672 0.778 0.088

Results: Records per Feature

Summary

Conclusion